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Abstract 



Efncient data-flow implementation requires fa5i run-time mechanisms to detect and dis- 
patch schedulable tasks. However, the inherent non-determinism in data-flow executions 
and the requirement of fast, and therefore, simple run-time mechanisms necessitate compile- 
time support to improve performance. In particular, for data-flow execution of applications, 
such as signal processing which are characterized by periodically received data, compile-time 
support can be used to control the run-time behavior to improve the predictability and effi- 
ciency. In this report, a compile- time technique that supports a simple run-time mechanism 
to improve throughput and predictability for a task-level data-flow programming model is 
described. This technique, called the revolving cylinder analysis, restructures the applica- 
tion, described by a ta^k-level data-flow graph. The restructuring is based on wrapping the 
projected data-flow execution trace on the curved surface of a cylinder whose area depends 
upon the number of processors and the sum of the task execution times. The behavior of 
the restructured graph is shown to be more predictable under the same run-time mechanism 
than that of the old graph. Results on the performance improvement for two typical signal 
processing applications, viz., a correlator and a fast Fourier Transform, are presented. The 
potential of this approach in determining the optimcd granularity for an application is also 
described. 



1 Introduction 



Data-flow graphs not only describe the dependencies between dilferent parts of the computa- 
tion required in an application, but also provide built-in scheduling and synchronization. For 
example, on a hypothetical system with no communication cost and an unlimited number 
of processors, nodes can synchronize by sending data and a node can be scheduled as soon 
as all the required data is present at its input. Due to the generality of this representation, 
it can be used to specify parallelism at the instruction level [BroST, SFP83] a3 well as at 
the task level [LM87]. The theoretical foundation for the consistency of such representations 
has been well studied [KM66, Lee91]. In practical implementations of this paradigm, the 
machine must provide mechanisms to manage the data that flows through the graph and 
to capture the intrinsic scheduling and synchronization. These mechanisms, typically oper- 
ating at run-time, result in overheads that lead to suboptimal performance. The amount 
of overhead depends critically on the granularity of the parallelism expressed by .the graph 
and on whether the computations have conditionals and recursion. A direct implementa- 
tion in hardw'are of the data-flow paradigm for general applications results in unmanageable 
overheads [GKW85, Bro87j. 

However, for classes of applications, such as signal processing, data-flow can be managed 
very eflBciently to obtain significant performance improvement. The two properties of these 
applications that make this possible are availability of a priori knowledge of the amount 
of data produced and consumed and negligible use of conditionals and recursion. When 
the amounts of data produced and consumed by the nodes of a data-flow graph are known 
exactly, the applications are called synchronous data-flow applications [LM87]. WTien the 
data arrives periodically, they have been classified as pipelined function-parallel computations 
[KCN90]. ■ ' 

Any data-flow implementation must perform buffering and fetching of data, allocation of 
graph nodes to processors, their ordering on each, and the exact times at which they are 
scheduled. If all the related decisions are done at run-time, the efficiency of the implementa- 
tion suffers. The overheads can be reduced effectively by using the node and arc attributes 
of the data-flow graph at compile-time to simplify the run-time management. 



Based on which decisions are made at compile-time and which ones are made at run-time, 
data-flow implementations caxi be classified over a spectrum that ranges from fully-static to 
fully- dynamic [LB90]. While dynamic implementations have more overhead, they are more 
flexible and are easier to implement. They aJso degrade gracefully in the even of individual 
processor malfunction. On the other hand, static implementations are more efficient and lead 
to predictable performance which is crucial to real-time systems. However, they axe difficult 
to realize, are inflexible, and do not degrade gracefully. Their effectiveness is determined 
by how accurately the 'Computational requirements of the- application are known. This is 
typically a difficult problem and its solution of using the worst-case estimate can result in 
large inefficiencies. Therefore, real-time systems must strike a careful balance between the 
compile-time effort and run-time complexity to get the desired and guaranteed performance. 

In real-time signal processing applications, the trade-offs between compile-time and run- 
time has an additional dimension because of the periodic arrival of data. When external 
data arrives periodically, the intrinsic non-determinism, of data-flow execution results in 
unpredictable program behavior. .\s a result, processed data arrives unpredictably leading 
to the possibility of intolerable delays and insufficient buffer space, especially under high 
loads. 

The focus of this work is on compile-time mechanisms for controlling data-flow implemented 
using a simple run-time mechanism for real-time signal processing applications. We present a 
technique in which, instead of generating information, such a.s schedules, to control allocation 
or ordering on processors at run-time, a new data-flow graph is obtained as a result of the 
compile-time analysis. The behavior of this new graph is more predictable under the same 
run-time mechanism than that of the old graph. Section 2 describes a model for task-level 
data-flow processing and illustrates the problems associated with fully dynamic data-flow 
execution of real-time signal processing applications. Section 3 describes the proposed ap- 
proach and presents the graph restructuring algorithm. Section 4 describes the effectiveness 
of this approach on two applications using the results of a simulation. Finally, in Section 
5, the potential of graph restructuring and how this approach can be developed further is 
described. 
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2 A Model for Task-level Data-flow in Signal Process- 
ing 



Figure 1 shows the axchitecturaJ model under consideration for task-level data-flow. This 
model closely resembles the AN/UYS-2 parallel signal processor developed by the US Navy 
[Ric90]. The-model has four basic types of elements, viz., the processors (P), memory modules 
(M), scheduler (SCH), "and the interconnection network. The processors execute individual 
nodes of the data-flow graph. Each processor has a local memory in which data on all the 
input queues as well as the instruction stream corresponding to the node are first fetched. 
.A.11 input and output queues of the graph ^ are stored in the memory modules. The memory 
modules monitor the state of these queues, i. e.. whether there is space for additional data, 
the amount of data- has gone above or below certain predetermined threshold and capacity 
levels. Changes in the status of a queue atre sent to the scheduler. This information is used 
by the scheduler to mahe run-time decisions. Memory modules also store the instruction 
streams for all the nodes in the graph. The instruction stream and data are moved between 
the processors and the memory modules across the interconnection network. The scheduler 
itself is a simple run-time dispatcher that matches the free processors in the free processor list 
(FPL) with the ready nodes in the ready node list (RNL). The operation of this architecture 
is briefly described below. 



2.1 Data-flow Execution 

Applications are specified as data-flow graphs which axe directed, acyclic graph with nodes 
representing large grain computations.^ chosen from a library of signal processing functions. 
The edges of a graph represent queues which receive data from the source node and supply 
data to the destination node. Each queue is allocated to a memory module for storage 
which maintains its current size and the remaining capacity. As data arrives on all the input 
queues of a node, the threshold values associated with each queue is eventually exceeded. 
Threshold refers to the minimum number of data items that must be present in a queue for its 
destination to become ready. A node is ready for execution when two conditions are satisfied. 

’■Unless otherwise mentioned, the term graph always refers to a data-flow graph in the rest of the report. 
^Each node can be a complex program. 
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Free processor 
Ijsi (FPL) 

Ready node 
■ list (RNL) 




The SCH maintains the FPL and RNL. When a processor completes • 
setting up of the task assigned to it, it becomes free. When a node has 
all the data available on its mput queues, it becomes ready. If there is 
a free processor, a ready node is assigned to it Each memory module 
keeps track of the state of queues assigned to it and sends changes 
to SCH. At any ume. a processor may execute a node, set up the next 
one and breakdown the previous one. 



Figure 1; Model of A Parallel Task- level Data-flow Processor 
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All its incoming queues exceed their thresholds and all its output queues must be under their 
capacity values. All memory modules communicate the events of threshold/capacity crossing 
to the scheduler which determines if a node is ready. Initially cdl processors are on the FPL 
and the scheduler assigns them to nodes on the RNL. When a node is assigned to a processor, 
it fetches the data and the instruction stream corresponding to the node from appropriate 
memory module. When the entire instruction stream and queue data have been fetched, 
the setup of 4,he node is complete. A processor communicates this event to the scheduler 
to get itself placed on the FPL so that the next node may start getting set up. Thus, the 
node already setup begins execution while the next node gets setup with the restriction that 
a processor may have only one node setup and pending to execute at any time. The data 
generated by the execution is first stored locally. Upon completion, a processor transfers 
the data to appropriate memory module storing the output queues in what is referred to as 
the breakdown phase. Thus, any node goes through three phases at a processor, viz.., setup, 
execution, and breakdown. Since their functions are independent and the set-up /breakdown 
operations may require time comparable to the execution time, these operations can be 
overlapped by providing independent functional units for execution unit and data movement 
unit within a processor. 

Upon arrival of sufficient data at the nodes which receive data only from the external world, 
an instance of the graph is started and its execution proceeds according to the data-flow' 
principle. As a result of the data-flow execution, which corresponds to asynchronous task- 
level pipelining, several instances of the graph are active simultaneously. Aside from the 
requirement that the required throughput must be met by the machine, real-time perfor- 
mance may require that all instances of the graph should complete in the same amount of 
time. Between the completion of the setup of a node at a processor and the actual start of its 
execution, there may be a delay because the execution unit at a processor has not completed 
the previous node. This delay, that may be experienced by a ready node, is in addition to 
the delay it may experience waiting on the RNL. Both delays result in an increase in the 
latency of the graph execution. On the other hand, an execution unit may have to wait for 
the setup completion of the next node assigned to it after it completes its current node. If 
this happens, execution cycles are lost and the machine throughput degrades. 

To maximize throughput, all execution units must run all the time, and therefore, each 
processor must have some node set up for execution at the time it finishes the previous node 
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computation. Since the scheduler is a simple run-time dispatcher that matches RNL nodes 
to free processors, the delays described above depend upon the application execution profile. 
This profile depends upon the data rate, the spatial amd temporal parallelism in the graph, 
the number of processors, the number of memory modules, and the allocation of queues 
to memory modules. Since taisk-level parallelism is being considered, performance can be 
improved significantly if setup and breakdown cost can be minimized. One method to reduce 
this cost is to chain successive nodes together and execute them on a single processor one 
after the other. This results in saving the breakdown cost for the first node and setup cost 
for the second node. 



2.2 Unpredictability in Program Behavior 

In real-time environments, the ability to predict the program performance is critical for 
efficient allocation of resources such as memory modules, processors, and queue sizes. How- 
ever, the first-come-first-served (FCFS) assignment of processors to ready nodes in the above 
data-flow model is intrinsically non-deterministic. This non-determinism manifests itself as 
degraded performance in two ways, viz., irregular execution patterns and interference at the 
memory modules. 

When data arrives periodically, the unpredictable execution patterns arise due to the absence 
of direct control over execution of nodes that depend only upon the receipt of data from the 
external world. If the output queue capacities for these nodes were unlimited, they would 
execute at a rate that matches the input arrival and is independent of the rate at which 
other nodes execute. In the presence of finire queue sizes, they execute at the input rate 
until the output queues get filled; and then, stall until nodes down the graph create space 
in the queues by consuming data. This leads to the individual graph instances not being 
executed in a uniform manner. This is undesirable in real-time scenarios. In addition, the 
machine throughput will degrade because the memory access patterns may be such that 
there is interference at the memory modules while setting up and breaking down nodes. 

This problem of controlled dat-fiow execution hats been addresses in different contexts before. 
For example, in [SMS90], input control has been applied to real-time execution of of graphs 
on multi computers. In order to achieve predictability, a custom operating environment called 
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AMOS has been developed. In [SA91], similax unpredictability has been observed due to 
the FCFS nature of self-routing of messages in a multicomputer network. The solution 
proposed therein is a sequence of explicit scheduling of the communication resources. In 
the following section, a framework is presented that introduces additional dependencies in 
the graph based on the technique of revolving cylinder analysis. While only the problem 
of controlling execution is addressed in this report, the technique is general enough to be 
addressed to other problems such as reducing the memory contention and determining the 
optimal granularitj'-for- a given machine configuration. 



3 Graph Restructuring Using Revolving Cylinder Anal- 
ysis 



The important resources to be assigned in the model of Fig. 1 are processors and memory 
modules. We do not address the problem of allocating data queues to memor}' modules 
so that memory contention is minimized in this report. The scheduler assigns processor 
resources on a FCFS basis. The key idea in restructuring based on RC analysis is that 
inserting dependencies in the graph can produce a graph with better performance. This idea 
can be traced back to algorithms for overlapping complex operations on pipelined processors 
[RGP82]. This restructuring selectively changes the conditions when a node will enter the 
RNL; however, choosing the processor to schedule it on is left to the run-time dispatcher. 
This enables the actual scheduling to remain dynamic keeping the run-time overhead low. 



3.1 Revolving Cylinder (RC) Ancdysis 

Given a graph as in Fig. 2, it is possible to systematically determine w'hether it can be 
mapped on a certain number of processors while satisfying the required data rates. For 
simplicity, we neglect the breakdown and setup times of each node. It can be proved that 
the graph could be scheduled (ignoring overheads) such that the consecutive graph instances 
axe separated - on the average - t steps aw’ay from each other, where t is equal to the total 
execution time of the PGM divided by the number of processors. This corresponds to the 
maximal average throughput since the processors will be fully utilized. Thus, for the graph 
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Figure 2; A Simple Data-flow Graph 

of Fig. 2, in which the execution times are shown alongside the nodes, a new- instantiation 
could be started every 6(= y) cycles w-hen 2 processors are used. We assume, for simplicity 
of explanation, that data arrives at this exact rate, although it is not a necessary condition 
for the algorithms discussed later. The graph of Fig. 2 can be modified by inserting delays 
as shown in Fig. 3. A schedule for an instance of the modified PGM is shown in Table 

1. Another instance of the modified graph can be overlapped with the first instance after 
six clock cycles, and so on. The idea of adding delays to improve overall throughput at 
the expense of latency for a single instance has been discussed in the context of hardware 
pipelining in [KogSl]. 

For this graph, except for the first 6 processor cycles, which represent a transient, every 
subsequent group of six consecutive cycles could be summarized by the schedule in Table 

2. Table 2 could be derived from Table 1 as follows. .Assume that there is a cylinder whose 
circumference is the intended length of Table 3.1 (6 in this example) and whose height is the 
number of processors, 2 in this example. Hence, Table 2 (or any table of size 6 by 2) could be 
w'rapped around the cylinder such that its end meets its beginning. The line on the surface 
of the cylinder that separates the end from the beginning has the effect of a divide-hy-C 
counter, w-here C is the circumference, every time it is crossed to enter the beginning from 
the end. Now, the first six cycles of Table 1 could be w'rapped around the cylinder, then the 
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Figure 3; Example graph with Delays Inserted 



Table 1: A Schedule for One Instance of the Example Graph with Delays 



Cycle # 


AP, 


1 AP 2 


1 


a 




2 


h 




3 


C 1 


4 


C 1 


5 


d 


e 


6 


d 


e 


7 


1 e 


8 


1 e 


9 


/ 


10 


/ 



Table 2: Compact Representation of RC Assignment 



Cycle # ( 7 > 1 ) 


APi i 


1 AP 2 


6z — 5 


1 a.' 1 


1 e.--i 


6z — 4 


1 bi 




6i - 3 


1 c,' 


/t-1 


6z - 2 


Ci ! 


/t'-l 


6z - 1 


1 d. \ 


1 e; 


6i 


1 d, 1 


I e,- 



second six cycles (and generally the process is continued until the table is fully wrapped). 
The choice of delays in the graph of Fig. 3 and the circumference of the cylinder is such 
that when Table 1 is wrapped around the cylinder, no node is going to lay over another 
node. Hence, the cylinder mapping is conflict-free. One minor complication to. the above 
procedure is to assign indices to the nodes on the surface of the cylinder to match those in 
Table 2. This is established by initially giving index i to all nodes and subtracting from the 
index of a node the number of revolutions taken around the cylinder before it is assigned 
its processor cycle(s). This is done to preserve the correctness of the graph, since for our 
example, cannot be started at the same time as is, yet eo can be. 

Figure 4 illustrates how the entries of the cylinder are indexed. It illustrates that a node can 
start and continue across the surface boundary. The execution of a node. X, can be split in 
two parts of length a and h as shown. The upper part has index z — 1 because, even though 
it is a continuation of the lower part, the index has decreased by one as we go around the 
surface once. 

The above procedure assumes that the cylinder's circumference and the modified graph with 
delays on its edges are given. The circumference of the cylinder is equal to the length of Table 
2 and is equal to the smallest integer such that a new graph instance could be separated from 
the previous one. On the other hand, the delays on the edges are not part of the original 
problem and were used for the sake of clarity. In reality, the delays are not needed to be 
known a priori. A scheduling algorithm could be devised to take the graph in Fig. 2 and 
obtain the cylindrical assignment of Table 2 without using the information given in Fig. 3 
or Table 1. This algorithm is given Fig. 5. 
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Figure 4: Illustration of Index Assignment 
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procedure AssignJFlC (G, p): /*G is directed acyclic graph*/ 
/*p is the number of processors*/ 
q topological sort (G); /*0(e), q is a queue*/ 
for all nodes n,- 

est(rii) ♦— 0; /* est is the earhest starting time of a node*/ 
circumference 0 
for all nodes n» 

circumference ♦— circumference ~ w{rii) 
l'‘xv{rii) is the size of node n,-*/ 
circumference - 

while q is not empty 

temp — remove-top (q); 

t — schedule_node(temp. est(temp). cylinder) ; 
for all descendents of temp 

est(descendent) ■«— max(est(descendent), t -r w(temp): 
end(while) 



procedure schedule_node(temp. t, cylinder) 
scheduled — false; 
while not scheduled 

try to place temp on cylinder surface slot 
starting a.t t' = t mod circumference 
if inserted - 

scheduled — true; 

else t' t— {t' -}- l)mod circumference: 
end(while); 
return f'; 



Figure 5: An Algorithm to Perform RC Assignment 
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The algorithm of Fig. 5 is guaranteed if all the node execution times axe equal, otherwise 
there is a chance that it can fail. However, this drawback can be easily remedied as follows. 
AssignJElC can be used to schedule k copies of the graph, G, on a cylinder whose circum- 
ference is ^ XI TT-ode weights and k is iteratively increased until it works. The case of k = p 
is guaranteed to work since the circumference then equals the sum of node weights: however, 
it is desirable to have k as small as possible. 

It should be noted that different schedules which sustain the maximal load could be obtained 
for any graph. Any assignment of nodes on the surface of the cylinder such that no node 
is preempted, and no two nodes are mapped to the same square is valid. The availability 
of multiple schedules which could sustain the same throughput has an important advantage 
with respect to determining the optimal granularity. For example, nodes can be grouped 
together on the surface of the cylinder so as to introduce optimizations to minimize the loss 
of processor cycles due to such overheads as setup and breakdown times or to minimize the 
interference due to memory accesses. 



3.2 Graph Restructuring 



Since the run-time mechanism of the scheduler is fixed, any execution sequence enforcement 
must be accomplished by compile-time techniques. The dashed lines in Fig. 6 show the graph 
of Fig. 2 with the additional data-dependencies used to enforce RC assignment at run-time. 
Each dashed line represents a queue of tokens generated by the source and absorbed by the 
destination. Each source generates a single token when it completes execution. The 2-tuple 
associated with each indicates the threshold and consume amounts for the control token flow 
on these arcs. The threshold amount refers to the number of tokens that must be present on 
the arc for its destination node to be eligible for execution. The consume amount refers to 
the number of tokens removed from the arc when it executes once. Thus, the arc from 6 to c 
forces node c to go on the RL only after b has completed. Given such restructuring, the setup 
and breakdown times for arcs (a, b), (6, d) (a, c) and (e, /) are saved by employing chaining 
a.s described at the end of subsection 2.1. It is assumed that implementing the control-token 
queues has an overhead cost that is negligible with respect to the cost of implementing data 
queues. It is further assumed that a node can be declared ready if all the data queues 
have crossed their thresholds, thus enabling a processor to begin its setup by fetching the 
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Figure 6: Restructured PGM Graph 

instructions and data associated with it although the control queues have not reached the 
threshold. Thus, the control token queues simply control the execution sequence on each 
processor. The algorithm to restructure the graph is given in Fig. 7. 

The restructuring of the graph in the example above is not unique. Since there are several 
ways of filling the table, there is a corresponding set of additional arcs. Even for a single 
assignment, there exist several sets of additional dependencies. This introduces the problem 
of selecting the best assignment and a suitable set of arcs associated with it for an arbitrary 
graph. The criteria that can be used for such selection axe minimization of the contention 
for resources or the number of additional arcs introduced. 



3.3 Advantages of RC Analysis 

There axe several advantages of such node-.A,P assignment if a compile-time technique can be 
found to enforce it on the scheduler run-time mechanism. Compile-time analysis of whether 
the machine will meet the required data rate becomes easy. Data-flow execution can be 
carried out in a controlled manner, thus improving predictability. Since the nodes are sched- 
uled relative to each other at compile-time, it becomes possible to take into consideration 
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procedure Restructure_graph (cylinder, circumference, G) 
f*nr,n, are nodes of graph, G*/ 
for all nodes, n,. 

check index i of n, 
find the latest node, n,, that ends 
before n, starts on the cylinder 
check index j of n, 

/*if Ur starts at the top of the cylinder, the latest*/ 
/*node ends at the bottom of the cylinder.*/ 

/*In this case, j should be decremented by one*/ 
introduce a synchronization arc from n, to n, 
if i > j 

put i — j initial tokens on the arc 
set threshold = 1, consume = 1 
else if i < j 

put 0 initiaJ tokens on the arc 
threshold = j — z. consume = 1 
end(for) 



Figure 7: Algorithm to Restructure the graph 
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the granularity of the graph. Chaining has been mentioned as a technique to minimize the 
cost of setup/breakdown of each node. However, unrestrained use of chaining decreases the 
amount of parallehsm in the application. RC analysis offers a systematic method to deter- 
mine the nodes to be chained and the resulting performance gain. For example, although 
it is possible to assign nodes in the above example in several ways, the assignment shown 
enables chaining nodes a, b. c. and d together and chaining e aind / together to minimize the 
setup and breakdown overheads. Thus, such an assignment can potentially take into account 
the overhead costs while mapping the cylinder. Once it has been determined w'hich nodes 
axe to be chained, the data queues can be allocated to memory modules so that contention 
is minimized. 



4 The Effectiveness of Graph Restructuring 



This section presents simulation results on the usefulness of graph restructuring for controlled 
data-flow execution of two typical signal processing applications. The correlator graph is a 
simple application while the fast Fourier Transform is a communication intensive graph. The 
predictability is modeled as the non-uniformity in the interval between two successive graph 
instance completions. This non-uniformity is observed as the interval between successive 
input data sets is varied up to the maximum possible on an ideal machine for the given 
graph. As mentioned previously, the minimum input data period is obtained by summing 
the task execution times and dividing by the number of processors. The plots in the next 
section are obtained by plotting the input data periods normalized by this maximum on the 
horizontal a.xis. The quaxitities plotted on the vertical axes axe the axithmetic mean of graph 
instance completion times, the standard deviation among the completion times, and the % 
application processor (.A.P) efficiency. 

The instance completion times cire normalized with respect to the input arrival period. Thus, 
the normalized instance completion time should be unity on ideal machines that meet the 
application requirements for any input rate. Ideally, a value greater than unity indicates 
that the machine cannot meet the application rates. However, in this case, due to a finite 
sized window of observation of the application behavior, the average values plotted are 
approximately unity when the machine meets the application requirements. 
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The plots of standajd deviation between the instance completion times give a better idea of 
the non-uniformity in execution. The input period is used to normalize the difference between 
the completion time of an instance and the ideal completion time. The % efficiency indicates 
the time for which the execution unit at the application processor was busy performing useful 
computation and not waiting for data. 



4.1 Correlator Application 

This graph was chosen to represent a simple, yet realistic, signal processing application. The 
corresponding graph appears in Fig. 8 [Tec90bj. The circles indicate the nodes to be executed 
and the arrows represent the queues holding the data required by the nodes. “T” represents 
the threshold value required before the destination node becomes ready. “R” represents 
the amount that is read by the destination node on execution setup. ”C^ represents the 
amount that is consumed on destination node breakdown. “P" represents the production 
amount from the previous node. Actual execution times for the primitives listed beside the 
nodes were obtained from the signal processing primitives library [Tec90a]. It was simulated 
assuming five processors and five memory modules. 

The points obtained for the graphs plotted in the case of the correlator graph were taken at 
5%intervals except in the region of close similarity where the interval was 1%. The results 
for normalized mean are shown in Fig. 9 and 10. While the difference between FCFS and 
RC is not discernible in- Fig. 9, Fig. 10 clearly indicates that the RC algorithm reaches 
unity 5% before the FCFS algorithm. At all times the RC curve remains below the FCFS 
curve on the graph. The normalized standard deviation, shown in Fig. 11, indicates that the 
RC algorithm provides a more uniform output than does the FCFS algorithm throughout 
the range of input data periods. Due to the dependencies inserted by the RC algorithm, 
the processor efficiency is lower for the RC case than for the FCFS case until uniformity in 
output is obtained as showm in Fig 12. This result is caused by the dependencies inhibiting 
the earlier nodes in the graph from executing until they are satisfied. While the efficiency is 
slightly lower for the RC approach, the lower normalized mean and standard deviation results 
indicate an improvement by use of the RC algorithm over the FCFS scheduling technique. 
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Exiamai Input Queue 
T-R-C-1638^ 

1: RXFLV-5000 
T-R-C-16364 
3: BAND1 - 15000 

T-R-16423.C-16384 
5: RRl - 10000 



T-R-C-4096 



8: FFT1 - *100000 

T-R-C-4096 

10: WINDOW1 -40000 

T-R-C-4096 
12: MULTXY-7500 
T-R-C-4096 
15: INVERSEFrT- 

100000 



T-R-C-513..P^2052 

7-R-0513 

18: EXPAVG-5000 
T-R-0513 



20: ASCANOUT- 10000 
External Output Queue 




External Input Queue 
T-R-C-16384 
^ RXFL2-5000 
T-R-C- 16384 
4: BAND2- 15000 

T-R-16423.C-16384 
6: F1R2- 10000 

T-P-C-4096 
7: 2ERORLL - 5000 

T-R-C-4096 
9: FrT2- 100000 

T— 

11: W1NDOW2-40000 



T— 

13: POWERX. 100000 

14; POWERY • 100000 
T-R-C-4.T-R-C-4 
16: MULTPWR.SQRT - 
5000 

T-R-C-1.P-4 

17: INTEGRATE- 

20000 

T-R-0513 

19: GRAMOirr- 10000 
T-R*^-513 

Exiamai Output Queue 



Figure S: Data-flow Graph for the Correlator Application 



FCFS Mean Dashed. RC Mean Solid, for Correlator Graph 




Normalized Input Dau Interval 



Figure 9; Correlator Graph - Mean Instance Completion Times 



FCFS Mean Dashed, RC Mean Solid, for Correlator Graph 
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0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 

Normalized Input Dau Interval 

Figure 10: Correlator Graph - Blow-up of Mean Instance Completion Times 
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% AP Efficiency Normalized Standard Deviation 



0.5 



FCFS Sid. De^. Dashed. RC Sid. Dev. Solid, for Correlator Graph 




Figure 11: Correlator Graph - Standard Deviation 



FCFS Dashed. RC Solid, for Corrdaior Graph 



Ir 




Figure 12: Correlator Graph - Processor EfBciency 
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4.2 Fast Fourier Transform Data-flow Graph 



The fa^t Fourier Transform (FFT) algorithm vfas chosen to examine the effects of the RC 
analysis on a communication intensive graph. The graph for a 2-D FFT can be represented 
in terms of that of a one dimensional (1-D) FFT. This application cLSSumes a 256 point vector 
of inputs. The 1-D FFT can be calculated in log 256 stages of operations with 128 operations 
per stage. Each stage can be divided into p parallel tasks, with ^ operations per task. As 
the tasks in stage i finish, they send their outputs to the tasks in stage z -f 1. The data-flow 
graph for a 2-D FFT uses 2 log 256 stages to transform a 256 x 256 matrix of inputs. 256 TD 
FFT’s are computed for rows followed by another 256 TD FFT’s for columns. Tasks in the 
first 8 stages perform 1-D FFT’s on all 256 rows with each task performing ^ operations. 
Tasks in stage log 256 send data to tasks in stage (8 -f 1) in such a way that the second set of 
8 stages performs 256 column transforms. The numbers beside the queues represent queue 
over threshold, production, and consume values in micro-seconds. The 2-D FFT graph is 
shown in Fig. 13. 



This data-flow graph was simulated on a machine with 8 processors and 8 memory modules. 
The normalized mean for FFT is shown in Figs. 14 and 15. Here also, the input data rate 
is met 5% before that of the FCFS algorithm when RC-based restructuring is used. Due 
to the high communication overhead as compared to the previous graph, the input rate 
met satisfied by this machine is lower. The normalized standard deviations are shown in 
Figs. 16 and 17. Again, clearly the RC standard deviation outperforms the FCFS standard 
deviation throughout the spectrum of input data rates. The normalized standard deviation 
is consistently less than 0.5 regardless of load level. Figure 18 demonstrates the differences 
in processor eflSciency for the FFT graph. The low values axe caused by the communication 
overhead involved in processing this type of graph. The restructured graph yields a greater 
processor eflBciency due to the assigned dependencies limiting the data movement traffic. 
This implies that a much more uniform output results from the RC algorithm regardless of 
load. 
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Figure 13: 2-D FFT Data-flow Graph 
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Normalized Mean Normalized Mean 



FCFS Mean Dashed, RC Mean Solid, for 64 Node Graph 




Figure 14: FFT Graph - Mean Instance Completion Times 



FCFS Mean Dashed. RC Mean Solid, for 64 Node Graph 




Normalized Input Dau Interval 



Figure 15: FFT Graph - Blow-up of Mean Instance Completion Times 
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Normalized Standard Deviation Normalized Standard Deviation 
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FCFS Std. Dev. Dashed. RC Std. Dev. Solid, for 64 Noae Graph 




Figure 16: FFT Graph - Standard Deviation 



FCFS Std. Dev. Dashed. RC Std. Dev. Solid, for 64 Node Graph 




Normalized Input Dau Interval 



Figure 17: FFT Graph - Blow-up of Standard Deviation 
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FCFS Dashed. RC Solid, for 64 Nod© Graph 




Figure 18; FFT Graph - Processor Efficiency 

5 Concluding Remarks and Future Research 



In conclusion, the major contribution of this work has been to present a compile-time ap- 
proach to the enable efficient use of the data-flow paradigm in real-time applications with 
periodic ajrival of data. We have shown that the proposed approach of RC analysis provides 
a framework in which optimizations related to data-flow execution at the task-level can be 
carried out. In order to control the execution when input data arrives periodically, this 
technique restructures the application graph that ha5 a more predictable behavior under 
the same run-time mechanism. The results have been presented using typical applications. 
viz., the correlator and FFT graphs. They show that this approach does make the indi- 
vidual instance completion time more uniform regardless of the the input period and the 
communication overhead. 

Currently, the following issues with regard to the use of compile-time data-flow graph analysis 
are being investigated. 



25 



• Chaining of nodes results in saving the breakdown and setup overhead. However, 
unrestrained chaining results in loss of parallehsm and could be detrimental to processor 
efficiency. It is difficult to predict the effect of chaining two nodes for a FCFS execution; 
but if chaining is specified within the framework of RC analysis, its effect can be 
accurately predicted. 

• Given a specific assignment, it is known which queues are accessed at the same time. 
This information can be used to algorithmically assign memory modules to queues, so 
that the interference between nodes at a module is minimized. 

• There axe several ways in which the additiona^jdependencies can be introduced. The 
criteria to select the minimal set of dependencies to be introduced that provide the 
minimal, yet effective, control of the execution axe being developed. 
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